Note 8 B-Trees

Author: Zhen Tong 120090694@link.cuhk.edu.cn Lecturer: Jane YOU

CSC 3170 - Database System at CUHKSZ. This is an introductory course in database systems.

https://58191554.github.io/csc3170_23fall/dashboard.html

Back to Dashboard

B^+

tree structure

Motivation: The deficiency of index order organization is that when the file gets bigger, both the sequential search and index search costs get huge. We can reorganize the file and update indexing for the file, however, it needs extra time to reorganize, and during the reorganization, every operation in the database is pending. We need to find a better way to deal with the file size increasing problem.

Balance Tree Structure (A tree structure is multi-lay indexing)

The distance from the root to every leaf is the same.

Every non-leaf node has n/2 to n number of child nodes. (n is the fixed hyperparameter of the tree predefined. Order of the tree)

B+Tree structure：

The node structure：

One node has at most n number of pointer $[P_1,...,P_n]$ , and at most n-1 number of search key $[K_1,...,K_{n-1}]$ . The search key is sequentially sorted, therefore if $i<j⇒K_i<K_j$

一个节点最多有n个pointer $[P_1,...,P_n]$ 最多有n-1个搜索码的值 $[K_1,...,K_{n-1}]$ 。一个节点的搜索码值是顺序存放的因此，如果 $i<j⇒K_i<K_j$ .

叶节点结构 The leaf node structure：
Each leaf has at most n-1 search key, at least $\lceil{(n-1)/2}\rceil$ number of search key.
- For example when n = 4, then the leaf node has at most n-1 search key, and at least $\lceil{(n-1)/2}\rceil$ search key.
The last pointer is linked to the next leaf node.
For every two leaf nodes, if $L_i$ is on the left side of $L_j$ , $i.e.\ i<j ⇒$ every search key value in node i is less then all the search key value in node j
If the B+ tree’s index is used for the dense index, then every search key in the B+ tree should appear in somewhere some leaf node

非叶子节点结构 Internal Leaf：
Every node above the leaf node is a multi-layer sparse indexing.
The internal leaf can contain at most n pointers, and at least $\lceil{n/2}\rceil$ pointers. All the pointers point to the lower layer node.
Point $P_i$ points to the sub-tree whose search key is all smaller than $K_i$ , and larger or equal to the $K_{i-1}$ . You can think $K_i$ as a clapboard.
Root node： At least 2 child node, and at most n child node

B^+

tree performance

An internal node has $m = [\lceil n/2\rceil, n]$ numbers of pointers. For a tree with $depth = h-1$ , there are $m^{h-1}$ numbers of leaf nodes, which means the search key number is $K = m^h$ . The maximum depth $h = \lceil log_{\lceil{n/2}\rceil}K \rceil$

For example, when the B+tree’s node size is designed as the size of a block of disk (4KB), and when n = 100, with $10^6$ numbers of search key. The B+ tree is with depth $h = log_{100/2}(10^6) \approx 4$ . Therefore, to search an element you want, you only go through 4 depths.

Average Storage Utilization of B-Tree平均存储利用率

$K=$ the total number of the search keys of the tree

$n$ = the maximum volume of a node, which means a node can contain n/2 to n elements

$N$ = the node number in a tree, which changes during your DBMS operation.

$f$ is the fullness factor

The volume for any tree is $N\times n$

$ρ_{(\text{Storage Utilization})}=$ $\frac{K}{N\times n}$

When f = 1, the tree is full, which means the nodes are fully utilized, and the number of nodes reaches its minimum $N_{min} = \frac{K}{n}$ ，

When f = 1/2, the tree is half-full, every node is taking its most wasteful utilization that half of the space is not used. The number of nodes is maximum, $N\in [\frac{K}{n}, \frac{2K}{n}]$

If we take the number of nodes in a tree $N$ as a random variable, and assume the RV

$N\sim\mathbb{U}[N_{min}, N_{max}]$ in a continuous uniform distribution, then

$E(\rho) = E(\frac{K}{N\times n} ) = \frac{K}{n}E(1/N)$

$N\in \left[ \dfrac{k}{n},\dfrac{k}{nf}\right] \cdot f=\dfrac{3}{4}\\ E\left( \rho \right) =E\left( \dfrac{k}{Nn}\right) =\dfrac{K}{n}E\left( \dfrac{1}{N}\right) \\ =\dfrac{k}{n}\times \dfrac{nf}{k\left( 1-f\right) }\int _{\frac{k}{n}}^{\frac{k}{nf}}\left( \dfrac{1}{t}\right) dt=\dfrac{f}{1-f}\ln \dfrac{1}{f}\\$

Use fullness factor $f$ to generalize the general form:

$N\in[\frac{K}{n}, \frac{K}{n\times f}]$ , (For the B*-Tree, f = ⅔)

$E(\rho) = E(\frac{K}{Nn}) = (\frac{K}{n})E(\frac{1}{N})$

$f^\prime = （1-f）$

The CDF of $ρ$ is :

The pdf of $\rho$ is:

The variance of the Storage Utilization $\rho$

B^+

tree update

After we insert or delete some element in the leaf node, the leaf node can be too big, or too small. We need to break up, or merge.

When we need to insert or delete some element, we first need to search where the element is, to ensure the updated tree with a sorted search key, as before.

When we delete some search key, we need to remove all the elements one step to the left, to ensure there is no gap in between.

Insertion
- Take the leaf node apart
  add search key = Adams
  After we insert K=Adams, the number of pointers in that leaf node exceeds the maximum, we need to break down the leaf node.
- non-leaf node break down
  When we insert search K=Lamport
  The leaf node of [Gold， Katz , Kim] is the place to insert Lamport. The leaf node meets maximum, and needs to break down
  After we break down the leaf node, a new leaf node is created, and we need to add a parent node for Kim. At the same time, Kim’s parent node needs to add a new pointer to Kim’s leaf node.

Delete
When we want to delete the search key $K =$ Srinivasan, The parent node of Srinivasan needs to change its search key from Srinivasan to something else. We look up to the grand parent’s node and take Mozart as the new parent. After that, we need to merge the leaf node.
When we want to delete “Singh” and “Wu”
Mozart’s leaf node is less than the minimum(<n/2), and in need of merging other leaves
If we delete Gold this time, the depth decreases. The second level node becomes the root node. There is one search Key called Gold in the last, however, it doesn’t point to any available search key. It is still useful because it is a mark for the sorted mark for the following leaf node [Katz, Kim, Mozart]

The cost of inserting and removing a single entry (based on the number of I/O operations) is proportional to the height of the tree $O(log\lceil {n/2}\rceil (K))$

区别

B+Tree File Organization

The leaf node contains a pointer and search key, and the internal node point to the child node, not to the file record.

B-Tree Index Files

pointer point to record not only in the leaf node.指向非叶子节点的record

B*-Tree

A B+ tree with f =⅔，is a B*-Tree

...